📝 Deep Analysis：The Internal Logic of LLM Safety Mechanisms

🎯 Core Four-Element Analysis

📌 Fundamental Problem

How do LLM safety alignment and jailbreak attacks actually work? Why can aligned models still be jailbroken to bypass safety guardrails? What's really happening inside these black-box models?

🔍 Key Perspective

Critical Insight: Explain safety mechanisms through intermediate hidden states rather than just final outputs. The authors discovered:

Ethical concepts are learned during pre-training, not alignment
Safety alignment's essence is building associations: connecting early-layer ethical judgments with mid-layer emotional guesses

⚙️ Key Method

Weak-to-Strong Explanation (WSE): Using weak classifiers (SVM and MLP) to analyze strong LLM's intermediate hidden states

Technical Approach:

Extract last position hidden states [u_l] from each layer
Use weak classifiers to judge if states are ethical
Apply Logit Lens to decode mid-layer states into tokens, observing emotional evolution
Propose Logit Grafting to simulate jailbreak's disruption of association stage

💡 Core Findings

Three-Stage Safety Mechanism:

Early Layers (0-5): Models immediately identify malicious inputs based on ethical concepts learned during pre-training (>95% accuracy)
Middle Layers (16-24): Alignment training associates ethical judgments with emotions (normal inputs → positive emotions; malicious inputs → negative emotions)
Later Layers (25-32): Refine emotions into specific rejection/response tokens

Jailbreak's Essence:

Jailbreak cannot deceive early-layer ethical judgments
Jailbreak disrupts mid-layer associations, perturbing negative emotions into positive ones
When positive emotions dominate mid-layers, later layers generate harmful content

📐 Method Formalization

LLM Safety = Pre-training(Ethical Concepts) + Alignment(Association Mapping) + Refinement(Stylized Output)

Where:
- Early-layer Classification = Weak_Classifier(hidden_state_l) → {Ethical | Unethical}
- Mid-layer Association = Ethical_Judgment × Alignment_Weight → {Positive | Negative Emotion}
- Late-layer Refinement = Emotion_State → {Rejection_Token | Response_Token}

Jailbreak Attack = Perturb(Mid-layer Association) → Positive Emotion → Harmful Output

Logit Grafting Approximation:

Jailbreak Effect ≈ Replace(Malicious_Input_Mid_Layer_State, Normal_Input_Positive_Emotion_State)

🎤 One-Sentence Summary (Core Value)

This paper uses weak classifiers to analyze LLM intermediate hidden states, revealing a three-stage safety mechanism: pre-training learns ethical concepts that enable early layers to rapidly identify malicious inputs, alignment builds ethical-emotional associations in middle layers, while jailbreak attacks bypass safety by disrupting this association process - perturbing negative emotions into positive ones to generate harmful content.

This analysis is based on the research paper "How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States" by Alibaba Group and Tsinghua University.